Best arm identification via Bayesian gap-based exploration 



Matthew W. Hoffman* 
Bobak Shahriari* 
Nando de Freitas 

University of British Columbia; Vancouver, BC, Canada 



HOFFMANM<Q!CS.UBC.CA 
BSHAHR@CS.UBC.CA 
NANDO@CS.UBC.CA 



Abstract 

Bayesian approaches to optimization under 
bandit feedback have recently become quite 
popular in the machine learning community. 
Methods of this type have been found to 
have not only very good empirical perfor- 
mance, but also optimal theoretical regret 
bounds when analyzed from a frequentist 
perspective. In this work we study the- 
oretical, methodological, and empirical as- 
pects of the problem of best arm identifica- 
tion in stochastic multi-armed bandits from 
a Bayesian perspective. In particular, we in- 
troduce a Bayesian version of the gap-based 
method of (Gabillon et al., 2012). In the 
domain of sensor networks, with real traffic 
data, this approach shows significant gains in 
performance over both Bayesian cumulative 
regret techniques and frequentist simple re- 
gret methods. 

1. Introduction 

The problem of best arm identification in stochastic 
multi-armed bandit problems has recently received a 
great deal of theoretical attention (see Bubeck et al., 
2009; Audibert et al., 2010). As in more standard 
multi-armed bandit settings, this problem revolves 
around a decision maker who must repeatedly take ac- 
tions, i.e. by selecting an arm and observing a reward 
for pulling that arm; see for example (Berry & Frist- 
edt, 1985; Cesa-Bianchi & Lugosi, 2006; Gittins et al., 
2011) for extensive discussions. However, unlike more 
standard settings, the goal is not to maximize the cu- 
mulative sum of these observed rewards. Instead, the 
decision maker is allowed to interact with the bandit 
process during an exploration phase after which they 
are required to recommend a single arm; the decision 
maker is then judged only on the value of the single 

'Authors contributed equally. 



arm that is recommended. 

An almost canonical example of the best arm identifi- 
cation problem, also known as pure exploration, is that 
of product testing. Take for example, a company con- 
sidering different marketing strategies for their prod- 
ucts. The company might consider presenting these 
different strategies to a subset of their potential cus- 
tomers. In this setting, the company is not necessarily 
interested in persuading those particular customers, 
but is instead concerned with the problem of finding 
the best strategy for selling products to their customer 
base at large. This initial, limited exploration phase 
serves as a proxy for the much larger set of customers. 
We can then ask the question of how best to query cus- 
tomers during this exploratory phase in order to have 
the highest probability of success in the testing phase 
when the marketing strategy is ultimately rolled out. 
In this same vein, the popular "A/B testing" frame- 
work, used for tailoring many of the design choices of 
modern web and mobile applications, can be seen as 
a problem of best arm identification (see Scott, 2010; 
Kohavi et al., 2009). 

Another area of bandit research that has received a 
great deal of attention recently involves incorporating 
Bayesian methods. Although some of the earliest work 
on what would now be called bandit problems came 
from a Bayesian perspective (Thompson, 1933), the 
field has since become dominated by frequentist ap- 
proaches based on regret minimization (Robbins, 1952; 
Lai & Robbins, 1985). See also the work of (Auer et al., 
2002). In the past few years, however, Bayesian meth- 
ods have noticed something of a resurgence, partially 
due to their great empirical performance (Chapellc & 
Li, 2012; Scott, 2010). More recent theoretical work 
has also shown that even while these methods take a 
Bayesian approach to modeling each arm, they still 
possess optimal, cumulative regret guarantees. See 
the work of (Kaufmann et al., 2012a) for bounds on 
a Bayesian version of the classical upper confidence 
bound (UCB) approach, as well as (Kaufmann et al., 



Best arm identification via Bayesian gap-based exploration 



2012b) for bounds on an approach based on the origi- 
nal work of Thompson. More recently, work of (Russo 
& Van Roy, 2013) has explicitly considered similar 
sample-based approaches with correlation between the 
arms. 

We would also be remiss if we did not mention 
Bayesian optimization. Although a full review of this 
area is beyond the scope of this work, it does share 
a great deal of overlap with bandit methodology. See 
the work of (Brochu et al., 2010) for an extensive tu- 
torial. An explicit link between Bayesian optimization 
and bandit methods has also been drawn in the work 
of (Srinivas et al., 2010), wherein the authors develop a 
UCB approach using Gaussian processes to model the 
reward distributions. The authors also provide bounds 
on the cumulative regret of this method. 

Up to this point, however, all the Bayesian methods 
we have mentioned focus on the problem of cumulative 
rewards. A natural question to ask then, is how to in- 
corporate Bayesian methodology into the problem of 
best arm identification — most of the literature on best 
arm identification approaches this problem from a fre- 
quentist perspective. In the setting of Bayesian op- 
timization, the work of (Hcnnig & Schuler, 2012) for- 
mulates the global optimization problem and describes 
several techniques and approximations to make the in- 
ference problem tractable. In this work, however, the 
authors mainly focus on algorithmic contributions and 
do not provide any theoretical performance bounds. 
Our work features the first, to our knowledge, Bayesian 
approach targeting the problem of best arm identifica- 
tion with provable performance bounds with stochastic 
rewards. Alternatively, (de Freitas et al., 2012) show 
similar performance bounds to ours in the setting of 
Bayesian optimization with deterministic rewards. 

First, in Section 2 we formally introduce the prob- 
lem of best arm identification. Then, in Section 3, 
we introduce the method that we will use throughout 
the rest of this work, which builds on the gap-based 
exploration approach introduced in (Gabillon et al., 
2011; 2012). We provide a key generalization to the 
approach of Gabillon et al. which is crucial in deriving 
bounds for Bayesian models. Furthermore, our theo- 
rem subsumes the regret bounds for the frequentist 
methods based on Hoeffding and Empirical Bernstein 
bounds. Still in Section 3, we state Theorem 1 which 
bounds the regret in this general problem, and discuss 
its implications. The proof of this theorem, given in 
the appendix, follows that of the earlier work incorpo- 
rating our generalization. In Section 4 we introduce a 
Bayesian framework for gap-based exploration, which 
we call BayesGap. In particular, we consider the case 



of linear- Gaussian arms and use the result of Section 3 
to give a bound on this method's performance. To 
demonstrate the generality of the proof of Theorem 1 
we include a bound for Bernoulli bandits in the ap- 
pendix (see the supplementary material). Finally in 
Section 5 we consider the empirical performance of 
BayesGap as compared to a number of different ap- 
proaches from the literature on independent Gaussian 
arms, correlated Gaussian arms, and finally on a real- 
world sensor network optimization task. 

2. Problem formulation 

Consider a multi-armed bandit problem with a collec- 
tion of independent arms A — {1, . . . 7 K} such that 
the immediate reward of pulling arm k e A is charac- 
terized by a distribution with mean /x^. Note that 
the assumption of independence does not mean that 
the means of each arm cannot share some underly- 
ing structure — only that the act of pulling arm k does 
not affect the future rewards of pulling any other arm. 
This distinction will be relevant in Section 4.1. 

Next, we will let 

A fc = |max^ - n k \ (1) 

denote the difference between the fcth arm and the best 
alternative arm. For the optimal arm this coincides 
with a measure of how optimal that arm is, whereas for 
all other arms it is a measure of their sub-optimality. 
Finally, we will also let fi* denote the best arm where 
k* is its corresponding index. 

The problem of identifying the best arm in a multi- 
armed bandit process can now be introduced as a se- 
quential decision problem. At each round t the deci- 
sion maker will select or "pull" an arm a t € A and 
observe an independent sample y t drawn from the cor- 
responding distribution v at . At the beginning of each 
round t, the decision maker must decide which arm 
to select based only on previous interactions, which 
we will denote with the tuple (oi : t-i, yv.t-i)- For any 
arm k we can also introduce the immediate, expected 
regret of selecting that arm as 

Rk=V*-Hk, (2) 

i.e. the difference between the expected reward of se- 
lecting arm k versus the reward of selecting the best 
arm. 

In standard bandit problems the goal is generally to 
minimize the cumulative sum of immediate regrets in- 
curred by the arm selection process. Instead, in this 
work we consider the pure exploration setting which 



Best arm identification via Bayesian gap-based exploration 



divides the sampling process into two phases: explo- 
ration and testing. The exploration phase consists of 
T rounds wherein a decision maker interacts with the 
bandit process by sampling arms. After these rounds, 
the decision maker must make a single arm recommen- 
dation f2(T) € A. The performance of the decision 
maker is then judged only on the performance of this 
recommendation strategy. The expected performance 
of this single recommendation is known as the simple 
regret, and we can write this quantity as Rq(t)- Given 
an e > we can then define the probability of error as 
the probability that Rn(T) > € - 

Somewhat surprisingly (Bubeck et al., 2009) shows 
that any arm selection strategy that attains the opti- 
mal, logarithmic cumulative regret, i.e. of order log(t) 
obtains non-cumulative regret of order i -7 for some 
7. As a result, the only way to obtain exponentially 
vanishing probability of error is to abandon the op- 
timal cumulative rate and explore more aggressively 
(see Bubeck et al., 2009; Audibert et al., 2010). The 
Bayesian gap-based approach, as well as the more gen- 
eral gap-based approach we discuss, are both able to 
obtain exponentially vanishing rate, whereas standard 
cumulative regret methods provably do not. The ex- 
perimental results on this behavior are somewhat more 
subtle, however, and we will return to this in Section 5. 

3. General gap-based exploration 

At the beginning of round t we will assume that the de- 
cision maker is equipped with upper and lower bounds 
Uk(t) and Lk(t) on the mean reward for the fcth arm. 
For the time being we will make no assumption on 
these bounds other than that with probability at least 
1 — S they bound the mean fi k , i-e. 

Pr(L fc (t) < Mfe < U k {t)) > 1 - S. (3) 

We will also introduce the uncertainty diameter 
s k (t) = U k {t) — L k (t) associated with each arm k. 
Given bounds on the mean reward for each arm, we 
can then introduce the gap quantity 

B k {t) = maxUi{t)-L k (t), (4) 

i^k 

which we can easily see involves a comparison between 
the lower bound of arm k and the highest upper bound 
among all alternative arms. Ultimately the arm selec- 
tion strategy will be based on this index. However, 
rather than directly finding the arm minimizing this 
gap, we will consider the two arms whose upper and 
lower bounds define this minimizer, namely 

J(t) = argmin_Bfc(t) and j(t) = argmax £/&(£). 

keA k^J(t) 



Algorithm 1 

General gap-based exploration algorithm. 

1: init: select ea. arm once to obtain {a\-.K,yi:K) 
2: for t = K + 1, . . . do 
3: compute L k (t),U k (t), and B k (t) 
4: J (t) = arg min keA B k (t) 
5: j(t) = argmin fe#J(t) U k (t) 
6: select arm a t — argmax^g^-^ ../(*)} s k(t) 
7: observe y t — v at {-) 
8: break if termination condition is true 
9: end for 
10: return Q(t) 



We can then select action a t such that 

a t = arg max s k (t), (5) 
ke{j(t),j(t)} 

i.e. between these two arms select the arm with the 
greatest uncertainty. 

In Algorithm 1 we show a general algorithm for the 
problem of gap-based exploration, however note that 
we have also not defined the termination condition or 
the recommendation strategy Q(t). In this work we 
will consider the case of a fixed arm-selection budget, 
where the decision-maker must make exactly T arm 
queries. We note, however, that it is possible to extend 
these strategies to a setting where the decision maker 
can take as many actions as necessary to reach some 
desired certainty. The ability to handle both bounded 
horizon and bounded uncertainty is the main driver of 
the unified approach of (Gabillon et al., 2012); in this 
work we do not address the task of bounded confidence 
purely for reasons of simplicity. 

In the budgeted horizon setting the termination con- 
dition is simple: once t = T and the time-horizon is 
reached we must break out of the loop. We will then 
define the recommendation strategy as 

Q(T) = J(argminS J(t) (i)). (6) 

t<T 

Here we can see that this corresponds to finding the 
proposal arm J(t) which corresponds to the minimum 
over all bounds, over all times t < T. The reason 
behind this particular choice is subtle and will become 
clear in the proof of Theorem 1 in the appendix. 

We will first define N k (t) as the number of times arm 
k has been pulled after t rounds. Theorem 1 is most 
powerful when the behaviour of s k (t) is known as a 
function of N k (t — 1). Since this exact relationship 
can be hard to compute, we propose to use an upper 
bound on s k (t) instead: the tighter the bound, the 



Best arm identification via Bayesian gap-based exploration 



better the result. We will let gk ■ N — > M + be a strictly 
monotonically decreasing function such that 

*fc(t)<5fc(JVfc(t-l)). (7) 

One important feature of gk that we exploit in Theo- 
rem 1 is that, as a result of being monotonically de- 
creasing it must be injective and therefore reversible. 
In a slight abuse of notation we will let g^ 1 denote the 
left inverse of gk- Often, for unstructured and indepen- 
dent arms we can utilize the same bound gk — g for 
each arm. This generalization does, however, allow us 
to incorporate model information on an arm-by-arm 
basis, as we will see in the next section. 

In order to properly explore the bandit problem we will 
need to control how many times each arm k is pulled 
based on how difficult it is to determine whether that 
arm is optimal with accuracy e. We will do so by 
introducing an arm-dependent hardness quantity 

Hke =max(i(A fe + e),e). (8) 

and H e = J2 k H k 2 as a problem-dependent hardness 
parameter associated with the bandit problem as a 
whole. When it is not specified, the term hardness 
refers to H f . We will see this quantity reappear in a 
number of problem-specific bounds shown in the next 
section. 

Theorem 1. Consider a bandit problem with hori- 
zon T and K arms. Let Uk{t) and Lk(t) be upper 
and lower bounds that hold for all times t < T and 
all arms k < K with probability 1 — 6. Finally, let 
gk be a monotonically decreasing bound on the confi- 
dence diameter for arm k, as defined in (7), such that 
J2k 9k 1 (Hke) <T — K. We can then bound the simple 
regret as 

Pr(i?o ( T) < e) > 1 - KTS. (9) 

The result of Theorem 1 is general enough to accom- 
modate a large class of uncertainty models, whether 
frequentist or Bayesian. In addition, the theorem 
reduces the problem of proving a regret bound to 
that of checking a few properties of the uncertainty 
model. For example, by using Hoeffding or Bernstein 
bounds to define the confidence intervals we recover 
the bounds of (Gabillon et al., 2012). In the following 
section, we will apply this theorem to Bayesian Gaus- 
sian bandits in order to obtain a bound on the perfor- 
mance in terms of simple regret. We further demon- 
strate the flexibility of this theorem in the appendix 
where the same technique is applied to Bernoulli ban- 
dits. 



4. Bayesian gap-based exploration 

In this section, we will consider bandit problems 
wherein the distribution of rewards for each arm k is 
assumed to depend on unknown parameters ^ £ 9. 
We will write the density of each arm as vg k . When 
considering the bandit problem from a Bayesian per- 
spective, we will assume a prior density 9k ~ ^k(') 
from which the parameters for each arm are drawn. 
After t — 1 rounds, let Tk(t — 1) = {n < t : a n = k} 
be the the subset of past time indices such that arm 
k was selected. Given these indices, the posterior for 
the parameters of arm k can be written as 

4 (0 fe ) cx 7r°(0 fc ) J] v ek {y n ). (10) 

n£T fe (t-l) 

From this we can also see that only if arm k is selected 
at time t — 1 does the posterior need to be updated, 
as otherwise for all k ^ a t -i we have nj. — 7r£ _1 . 

We are, however, only partially interested in the poste- 
rior distribution of the parameters Ok- Instead, we are 
primarily concerned with the posterior distribution of 
the mean reward for each arm 

Hk=E\Y\6 k ] = J yve k {y)dy- (11) 

Again, we note that we do not know the true value of 
6k, instead we only have access to the posterior distri- 
bution over this variable at time t. As a result we only 
have a distribution over /z^., induced by the distribu- 
tion over the parameters, which at time t we can write 
as p{(p k ). 

In the subsections that follow we will use this poste- 
rior distribution to define upper and lower confidence 
bounds that both hold with high probability and give 
rise to a bound on the confidence diameter gk of the de- 
sired form. As a result, we can derive high-probability 
bounds on the simple regret which can easily take into 
account the structure of the problem. 

4.1. Gaussian arms 

Consider K arms, such that each arm k is associated 
with a known vector Uk € K d and where the rewards 
for pulling arm k are normally distributed 

v k {y)=U{y;u T k e,a 2 ) (12) 

with known noise a 2 and unknown 9 e R d . Note here 
that the rewards for each arm are independent condi- 
tional on 9, but marginally dependent when this pa- 
rameter is unknown. In particularly the level of their 
dependence depends on the structure of the vectors 



Best arm identification via Bayesian gap-based exploration 



u k . By placing a prior 9 <~ A/"(0,r? 2 /) over the en- 
tire parameter vector, however, we can still compute a 
posterior distribution over this unknown quantity. 

Let X t = [u ai . . . u at l ] T denote the design matrix and 
Y t = [yi . . . Vt-i] T the vector of observations at the 
beginning of round t. We can then write the posterior 
at time t as 7r'(#) = N{6; t ,t, t ), where 



<J- 2 X?X t + rT 2 /, 
a- 2 %XlY t . 



(13) 
(14) 



From this formulation we can easily obtain that the ex- 
pected reward associated with arm k is marginally nor- 
mal p* fc (/x fe ) = Af(fj, k ;fi k (t),&l(t)) with mean fi k (t) = 
uj&t and variance <J 2 {t) = u k T, t u k - Note also that the 
predictive distribution over rewards associated with 
the fcth arm is normal as well, with mean fi k (t) and 
variance a k {t) + o 2 . 

Finally, based on this posterior, we will introduce up- 
per and lower bounds given by 



L k (t) = p.k(t) - PZk(t), 
U k (t)= fi k (t) + l3& k (t), 



(15) 
(16) 



from which we can then claim the following: 

Corollary 1. Consider a K-armed Gaussian bandit 
problem with horizon T and let U k (t) and L k (t) be de- 
fined as above. Let n = J2 k \\ u k\\~ 2 ■ Then for e > 
and 

P 2 = ((T-K)/a 2 + K / V 2 )/(4H e ), 
the algorithm attains simple regret satisfying 

Pr(R n(T) < e) > 1 - KTe-^l 2 

Proof. First, to simplify notation let A = crfr/. Now, 
using the definition of the posterior variance for arm 
k as <r 2 (i), we can write the confidence diameter as 



Sk 



(t) = 2f3^u 

^2f3^J^ul(j: i N l (t^l)u l uf + X 2 l)- 1 u k 
2Py/a 2 ul (N k (t-l)u k u T k + A 2 J) -1 u fc . 



< 



In the second equality we decomposed the Gram ma- 
trix Xj X t in terms of a sum of outer products over the 
fixed vectors Ui. In the final inequality we noted that 
by removing samples we can only increase the variance 
term, i.e. here we have essentially replaced Ni(t — 1) 
with for i ^ k. We will let the result of this final 
inequality define an arm-dependent bound g k . Let- 
ting A = X 2 /N we can simplify this quantity using the 



Sherman-Morrison formula as 

9k(N) - 2f3\/(o-yN)uT(u k uT + Al)- 1 u k 



= 2/3, 



2/3, 



'<T 2 \\u k \\ 

N A 



\ l + \\uk\\ 2 /AJ 



o- 2 IK|| 2 



\ 2 + N\\u k \\z> 



which is monotonically decreasing in N. The inverse 
of this function can easily be solved for as 



9k\s) 



4(/3a) 2 



A 2 



\Uk 



12 ' 



By setting J2 k g k 1 (Hk t ) = T — K and solving for (3 
we then obtain the definition of this term given in the 
statement of the corollary. Finally, by reference to 
Lemma Dl we can see that for each k and t, the upper 
and lower bounds must hold with probability 1 — e~PI 2 . 
These last two statements satisfy the assumptions of 
Theorem 1, thus concluding our proof. □ 

5. Experiments 

We can now turn to the problem of empirically com- 
paring our proposed algorithm, BayesGap, to several 
other approaches advocated in the literature for the 
linear- Gaussian model introduced in Section 4.1. In 
particular we will consider the following approaches: 

1. UCBE: A highly exploring variant of the classical 
UCB policy of (Aucr et al., 2002), introduced by 
(Audibert et al., 2010). This approach replaces 
the log(t) exploration term of UCB with a con- 
stant of order log(T) for known horizon T. This 
encourages the algorithm to explore much more 
aggressively. 

2. UGap: The bounded-horizon 1 gap-based explo- 
ration approach introduced in (Gabillon et al., 
2012) and which Section 3 is based on. As men- 
tioned in that section, the algorithm is based on 
using confidence bounds derived from a Hocffding 
bound 2 around the mean. 

3. BayesUCB: An adaption of UCB to the Bayesian 
setting wherein the upper confidence bound is 



1 Technically this is UGapEb, denoting bounded hori- 
zon, but as we do not consider the fixed-confidence variant 
in this paper we simplify the acronym. 

2 The authors also introduced a variation, UGapV, 
which uses tighter Bernstein bounds. However, in this pa- 
per we restrict our comparison to UGap. We note also that 
for bounded horizon problems in this earlier work, UGap 
and UGapV obtained similar results. 



Best arm identification via Bayesian gap-based exploration 



given by an upper quantilc on the posterior mean 
(see Kaufmann et al., 2012a). 

4. Thompson sampling: A randomized, Bayesian in- 
dex strategy where the probability of selecting 
the kth arm is given by a single-sample Monte 
Carlo approximation to the posterior probability 
that arm k is the best arm. See also (Chapelle & 
Li, 2012) for an empirical study, and (Kaufmann 
et al., 2012b) for a theoretical analysis. 

Among these algorithms, (1-2) attack the pure ex- 
ploration problem, whereas (3-4) are optimal for cu- 
mulative regret problems. Algorithms (3-4) are also 
Bayesian approaches. For the cumulative regret ap- 
proaches we used as recommendation strategy the in- 
dex of the best posterior mean after T rounds. Note 
also that we did not compare against the classical UCB 
algorithm due to the fact that its bounds are sub- 
optimal compared to those of BayesUCB, a fact that 
is also borne out empirically (see the above citations) . 

We also note that in the linear-Gaussian setting we 
analyze here, the approach of BayesUCB — as pointed 
out by (Kaufmann et al., 2012a) — coincides with the 
linear optimization approach of (Dani et al., 2008) for 
a particular choice of uninformative prior. As a result, 
our method would also be able to take advantage of 
this prior structure by replacing the simpler A/"(0, rj 2 I) 
prior to account for unknown variance. 

Finally, although (3-4) yield competitive results in 
the following experiments, the fact that these meth- 
ods achieve the optimal logarithmic cumulative re- 
gret bound (Kaufmann et al., 2012a) implies that 
they are provably sub-optimal in the simple regret set- 
ting (Bubeck et al., 2009). In contrast, BayesGap has 
provable, exponentially vanishing simple regret (see 
Corollary 1), along with UGap and UCBE. 

For the remainder of the section, unless otherwise spec- 
ified, the Bayesian methods are given a Gaussian prior 
over 9 with mean zero and variance rj 2 = 1. The obser- 
vation model in the synthetic experiments is also given 
to the methods and is set to a Gaussian with variance 
a 2 = 0.25. 

For the simple regret approaches, we also give each 
algorithm the hardness estimate H e . In practice one 
would not know this parameter, and instead would 
have to estimate it in an adaptive fashion as done in 
(Audibert et al., 2010; Gabillon et al., 2011). We did 
not do this for simplicity and also for a more direct 
comparison with the results of the most closely related 
algorithm, that of (Gabillon et al., 2012). Finally, as 
is often the case in the literature (see the above cita- 
tions), we found that tuning the exploration parameter 



of each of these approaches subtly improved their be- 
havior. In the case of BayesGap this amounts to using 
a parameter f3 — eft for some constant c > 0, and sim- 
ilarly for the other two methods. We should also note 
that while UCBE and UGap assume bounded rewards, 
the use of an exploration multiplier c for both algo- 
rithms helps to correct for the fact that our rewards 
are actually unbounded. One could consider different 
bounding techniques in order to extend these models 
to unbounded rewards, but we should point out that 
moving into the Bayesian setting is one particular way 
to take into account rewards of this form. 

The next four sections are each dedicated to a separate 
experiment. In Section 5.1, we evaluate the sensitiv- 
ity of the tested methods to their hyper-parameters. 
In Section 5.2, we fix the hardness H e and evaluate 
the methods for various horizons. The next two sec- 
tions introduce correlation via Gaussian process (GP) 
techniques. In Section 5.3, we present an experiment 
where the means /ik are set by sampling a function 
at discrete points, a function which was itself sampled 
from a synthetic GP. In Section 5.4, we obtain fi^ sim- 
ilarly from a GP that was fit to real data. 

5.1. Sensitivity Analysis 

We first present experiments on a synthetic set of 
Gaussian arms in order to analyze the sensitivity 
of BayesGap, UCBE, and UGap, to their hyper- 
parameters. We considered K = 20 independent 
Gaussian arms, and as noted earlier we set the prior 
and likelihood (observation noise) variances to a 2 = 
0.25 and rj 2 = 1 for BayesGap. We ran multiple ex- 
periments by varying the hardness of the problem with 
H e e {40,80,160,320}; note that the choice of H e 
corresponds to four multiples of K/a 2 . We used this 
choice of H e to set the arm means such that each run 
includes an optimal arm of n* = <q/l+y/KjHi and all 
other arms have mean ^ = r]/4. We also varied the 
time horizon, using T e {25,40,80,160}. Finally, for 
each (T 7 H e ) pair we studied the performance of the 
three algorithms using a regular, logarithmic grid for 
the exploration parameter c. A few typical examples 
are reported in Figure 1. 

We gained a number of insights from these experi- 
ments. First, BayesGap's optimal performance for 
fixed T and H e is comparable, if not better, than 
UGap and UCBE. In fact, for short horizons on "easy" 
problems, BayesGap outperforms them significantly. 
However this is often a regime of general poor per- 
formance for all methods as is illustrated in the two 
lower- left plots of Figure 1. Second, for low values of 
c, BayesGap is more sensitive to its parameter than its 



Best arm identification via Bayesian gap-based exploration 



T = 40 T = 160 

I I .. UCBE 




Exploration parameter c Exploration parameter c 

Figure 1. Sensitivity of UCBE, UGap, and BayesGap to 
the tuning parameter c for varying horizons T and hardness 
quantities H t . 



competition, while for large enough c, the behaviour 
is largely similar as can be seen in the left column of 
Figure 1. Finally, BayesGap's optimal c value seems 
to drift towards larger values as we look at larger hori- 
zons. We found in most of our experiments that a 
value in the range [1, 8] was appropriate. Only when 
we attempted to drastically increase the hardness of 
our problem did larger values perform somewhat bet- 
ter. 

5.2. Multiple Fixed Horizons 

In this experiment, we consider the bandit problem 
from the previous subsection with H t = 1280 and we 
fix c = 8 for all methods. This value of the hyper- 
parameter produced good results for all three methods 
in the previous experiment. We also consider a harder 
problem here than in the previous section to show the 
behaviour of these algorithms for long horizons. We 
then measure the performance of the algorithms with 
this value for a large range of fixed horizons. It is im- 
portant to note that this is not showing the evolution 
in time of the performance of a fixed algorithm. Rather 
each point in Figure 2 is an average over multiple in- 
dependent runs of the algorithm for a given horizon 
T. (Indeed, recall that a different T corresponds to 



0.9 




°'°0 500 1000 1500 2000 2500 



Horizon 

Figure 2. All the methods evaluated at different time hori- 
zons. The exploration parameter for UCBE, UGap, and 
BayesGap is set to c = 8. The probability of error is esti- 
mated using Monte Carlo with N = 1000 samples. Error 
bars show ±1 standard error. 

a different exploration trade-off and therefore a new 
instance of the algorithm.) Since c and H e are held 
fixed and T is being varied, each point in Figure 2 
is an entirely different algorithm instance. The main 
point to take away from this figure is that BayesGap 
performs as well as its competitors in the independent 
arms setting. It is perhaps somewhat surprising how 
well the cumulative regret methods do. This, however, 
somewhat matches previous observations from the lit- 
erature in that the comparatively "poor" performance 
of these methods may only appear for very hard in- 
stances or in the asymptotic regime; see (Bubeck et al., 
2009, Figure 4 and discussion). 

Notice that for large horizons, BayesUCB and Thomp- 
son seem to slightly outperform BayesGap. Given 
our sensitivity analysis results, this is not surpris- 
ing, since for large horizons, the optimal c value for 
BayesGap exhibits a drift towards larger values. More- 
over, Corollary 1 above and Theorem 1 from (Gabillon 
et al., 2012) guarantee that the probability of error 
for BayesGap and UGap vanishes exponentially fast 
in the limit, respectively. Conversely, BayesUCB and 
Thompson are provably sub-optimal due to the previ- 
ously discussed result of (Bubeck et al., 2009). 

5.3. Application to a Synthetic Optimization 
Problem of Varying Correlation 

In this experiment, we study the effect of correlation 
among the arms. If the structure of the correlation is 
known, then each arm pull possibly provides informa- 
tion about more than one arm. If this information is 
used judiciously, greater performance can be attained. 



Best arm identification via Bayesian gap-based exploration 



In order to create a problem with known, measurable 
correlation structure, we use a Gaussian process (see 
Rasmussen & Williams, 2005) evaluated at a discrete 
number of points using a squared-exponential kernel 
k(x, x') with length-scale £ 2 . This allows us to control 
the correlation between arms by adjusting t, where 
larger I corresponds to larger correlations. More pre- 
cisely, given points {xi} we can construct a matrix G 
such that Gij — k(xi,Xj). The matrix G is usually 
denoted K in the GP literature, however our nota- 
tion departs from this standard order to distinguish 
from the number of arms K. Translating this into a 
Bayesian linear model is then simply a matter of tak- 
ing the SVD, G = VDV T for diagonal D and unitary 
V, and constructing the design matrix U — VD?. 
We can then think of each arm k as corresponding 
to a particular element in the set {x{\. Then for 
6 ~ A/"(0, i] 2 ) we have that fik = u^O is a sample from 
a Gaussian process with the given kernel and signal 



£j 0.3 




100 200 300 400 500 600 

Fixed Horizon 



CD 

O 0.4 



O 




100 200 300 400 500 

Fixed Horizon 



Figure 3. Probability of error for multiple fixed horizons 
with (top) I = 0.25 and (bottom) I = 3.0, corresponding 
to low and high correlation, respectively. The probability 
of error is estimated using Monte Carlo N = 1000 samples. 
Error bars show ±1 standard error. 



variance r/ 2 . Essentially we are performing Bayesian 
optimization using a GP prior at a discrete number of 
points; see (Srinivas et al., 2010). In fact, in our setting 
BayesUCB corresponds to GP-UCB with a slightly dif- 
ferent arm-selection mechanism. 

A side effect of this type of correlation structure is that 
the best arm is now necessarily surrounded by very 
good arms. This gap between them decreases with in- 
creasing I which in turn makes the problem harder. 
Given this correlation structure we next considered 30 
arms, corresponding to 30 points on a simple ld-grid. 
For two length scales I — 0.25 and I — 3.0 we then per- 
formed N — 1000 runs of each bandit approach, each 
averaged using individual samples 9 from the prior. 
A plot similar to the one in the previous section is 
shown in Figure 3 for both of these length scales. In 
order to avoid diverging problem hardnesses, here we 
set e = 0.1 to upper bound the hardness of the problem 
by K/e 2 = 3000. This results in a value of H t which 
is higher than any problem we have considered so far, 
and at this level of difficulty we found that larger val- 
ues of c performed better, as was suggested by our 
discussion at the end of Section 5.1. 

The results reported in Figure 3 show that, as ex- 
pected, for low correlation (top), BayesGap performs 
as well as the frequentist methods, while the other 
Bayesian methods lack exploration. Meanwhile, for 
high correlation (bottom), since the Bayesian meth- 
ods incorporate the correlation structure in their pos- 
terior, the less exploratory BayesUCB and Thompson 
now achieve good performance, whereas the frequen- 
tist methods are out-performed because they are ag- 
nostic to the correlation structure. Note that for both 
length scales, BayesGap is one of the best performing 
methods. 

5.4. Application to Real Data 

Finally, we take the same Gaussian process based ap- 
proach from the previous set of experiments and apply 
it to real data. In particular we utilized data taken 
from traffic speed sensors deployed along highway I- 
880 South in California. This data, was also used 
in (Srinivas et al., 2010). Traffic speed was collected 
for all working days between 6AM and 11AM for one 
month using 357 sensors. The goal is then to identify 
the single location with the highest expected speed, 
i.e. the least congested. 

Rather than specifying a kernel over the space of traffic 
sensor locations we follow the approach of (Srinivas 
et al., 2010) and construct the matrix by treating two- 
thirds of the data as historical data, letting the kernel 
matrix G then be given by the empirical covariance of 



Best arm identification via Bayesian gap-based exploration 



0.14 
0.12 

O 
S_ 

LU 

° 0.10 
0.08 

S_ 
Q_ 

0.06 



UCBE UGap BayesGap BayesUCB Thompson 

Figure 4. Probability of error on the optimization domain 
of traffic speed sensors. For this real data set, BayesGap 
provides considerable improvements over the Bayesian cu- 
mulative regret alternatives and the frequentist simple re- 
gret counterparts. 

this dataset. We then applied the same transformation 
as above to construct the design matrix U for Bayesian 
linear regression. We also took the averaged variance 
of each individual sensor (i.e. the signal variance) and 
set the noise variance a 2 of our process equal to 5% 
of this value (4.78). Given this historical data we also 
selected, somewhat arbitrarily, a broad prior of r\ = 20 
(note that this data does correspond to traffic speeds 
in California). 

Finally, to evaluate this experiment we performed a 
single run of each bandit method using each of the 
remaining sensor signals as the mean vector /i with 
the given noise a and design matrix U. The results 
shown in Figure 4 are the probability of error for each 
method using a time horizon of T = 400. Here we 
used as c for each of the relevant methods the value 
suggested from our earlier sensitivity experiments; for 
BayesGap this corresponded to c = 8. 

We can clearly see here the value of BayesGap in that 
it combines the best of both the pure exploration prop- 
erties of UGap with the ability to take advantages of 
correlations via the Bayesian prior/posterior. 

6. Conclusion 

In this work we presented, to our knowledge, the 
first Bayesian approach to the problem of best arm 
identification with provable, exponentially vanishing, 
high probability regret bounds. In order to do so we 
built upon the earlier, gap-based approach of (Gabillon 
et al., 2011; 2012). We provided a key generalization 
to this earlier approach which accomodates Bayesian 



uncertainty models, non-symmetric confidence diam- 
eters, and simplifies the process of proving model- 
specific simple regret bounds. We then applied this 
approach to the problem of linear-Gaussian bandits, 
and in the appendix we provide bounds for Bernoulli 
bandits. 

Our experiments show that our proposed method is 
competitive with existing approaches both in the pre- 
sense and absense of correlated arm structure — and al- 
ways among the best exploration strategies. Further, 
where other state-of-the-art bandit techniques for lin- 
ear bandits have provably sub-optimal simple regret 
bounds, our approach is able to obtain exponentially 
vanishing regret. Finally, we showed that in a realis- 
tic optimization example our method can significantly 
outperform other existing techniques. 

References 

J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm 
identification in multi-armed bandits. In Proceedings 
of the Conference on Learning Theory, 2010. 

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time 
analysis of the multiarmed bandit problem. Machine 
Learning, 47(2):235-256, 2002. 

D. Berry and B. Fristedt. Bandit Problems: Sequential 
Allocation of Experiments. Chapman & Hall, 1985. 

E. Brochu, V. Cora, and N. de Freitas. A tu- 
torial on Bayesian optimization of expensive cost 
functions with application to active user model- 
ing and hierarchical reinforcement learning, eprint 
arXiv:1012.2599, arXiv, 2010. 

S. Bubeck, R. Munos, and G. Stoltz. Pure exploration 
in multi-armed bandits problems. In the Interna- 
tional Conference on Algorithmic Learning Theory, 
2009. 

N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, 
and Games. Cambridge University Press, New York, 
2006. 

O. Chapelle and L. Li. An empirical evaluation of 
Thompson sampling. In Advances in Neural Infor- 
mation Processing Systems, 2012. 

V. Dani, T. P. Hayes, and S. M. Kakade. Stochas- 
tic Linear Optimization under Bandit Feedback. In 

Proceedings of the Conference on Learning Theory, 
pp. 355-366, 2008. 

N. de Freitas, A. J. Smola, and M. Zoghi. Exponential 
Regret Bounds for Gaussian Process Bandits with 



Best arm identification via Bayesian gap-based exploration 



Deterministic Observations. In International Con- 
ference on Machine Learning, 2012. 

V. Gabillon, M. Ghavamzadch, A. Lazaric, and 
S. Bubeck. Multi-bandit best arm identification. 
In Advances in Neural Information Processing Sys- 
tems, 2011. 

V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best 
arm identification: A unified approach to fixed bud- 
get and fixed confidence. In Advances in Neural In- 
formation Processing Systems, 2012. 

J. Gittins, K. Glazebrook, and R. Weber. Multi-armed 
bandit allocation indices. Wiley, 2011. 

P. Hennig and C. J. Schuler. Entropy search for 
information-efficient global optimization. Journal of 
Machine Learning Research, 13:1809-1837, 2012. 

E. Kaufmann, O. Cappc, and A. Garivier. On 
Bayesian upper confidence bounds for bandit prob- 
lems. In Artificial Intelligence and Statistics, 2012a. 

E. Kaufmann, N. Korda, and R. Munos. Thomp- 
son sampling: an asymptotically optimal finite-time 
analysis. In the International Conference on Algo- 
rithmic Learning Theory, 2012b. 

R. Kohavi, R. Longbotham, D. Sommcrfield, and 
R. Hcnne. Controlled experiments on the web: sur- 
vey and practical guide. Data Mining and Knowl- 
edge Discovery, 18:140-181, 2009. 

T. Lai and H. Robbins. Asymptotically efficient adap- 
tive allocation rules. Advances in Applied Mathe- 
matics, 6, 1985. 

C. Rasmussen and C. Williams. Gaussian Processes 
for Machine Learning. The MIT Press, 2005. 

H. Robbins. Some aspects of the sequential design of 
experiments. Bulletin of the American Mathematical 
Society, 55, 1952. 

D. Russo and B. Van Roy. Learning to optimize via 
posterior sampling, eprint arXiv:1301.2609, arXiv, 
2013. 

S. Scott. A modern Bayesian look at the multi-armed 
bandit. Applied Stochastic Models in Business and 
Industry, 26(6), 2010. 

N. Srinivas, A. Krause, S. Kakade, and M. Seeger. 
Gaussian process optimization in the bandit setting: 
No regret and experimental design. In International 
Conference on Machine Learning, 2010. 

W. Thompson. On the likelihood that one unknown 
probability exceeds another in view of the evidence 
of two samples. Biometrika, 25(3/4):285-294, 1933. 



Best arm identification via Bayesian gap-based exploration 



A. Proof of Theorem 1 

Note that the proof of this section and the lemmas 
of the next section follow from the proofs of (Gabil- 
lon et al., 2012), but generalized to accomodate more 
general functions g and nonsymmetric confidence di- 
ameters S k . 

Proof. We will first define the event £ such that on this 
event every mean is bounded by its associated bounds 
for all times t. More precisely we can write this as 

£ = {Vfc < K,Vt < T,L k (t) <^ k < U k (t)}. 

By definition, these bounds are given such that the 
probability of deviating from a single bound is 5. Using 
a union bound we can then bound the probability of 
remaining within all bounds as Pr(£) > 1 — KTS. 

We will next condition on the event £ and assume 
regret of the form Rq^t) > e in order to reach a con- 
tradiction. Upon reaching said contradiction we can 
then see that the simple regret must be bounded by e 
with probability given by the probability of event £, 
as stated above. As a result we need only show that a 
contradiction occurs. 

We will now define r = arg min t<T -Bj(t) (t) as the time 
at which the recommended arm attains the minimum 
bound, i.e. f2(T) = J(r) as defined in (6). Let t k < 
T be the last time at which arm k is pulled. Note 
that each arm must be pulled at least once due to the 
initialization phase. We can then show the following 
sequence of inequalities: 

min(0, Sfe(i fc ) - A fe ) + s k (t k ) > B J(tk) (t k ) (a) 

>B n{T) (r) (b) 

> r n(T) (c) 

> e- (d) 

Of these inequalities, (a) holds by Lemma B3, (c) holds 
by Lemma Bl, and (d) holds by our assumption on 
the simple regret. The inequality (b) holds due to the 
definition f2(T) and time r. Note, that we can also 
write the preceding inequality as two cases 

Sfe(ifc) > 2s k (t k ) - A k > e, if A fe > s k (t k ); 
2s k (t k ) - A k > s fc (tfc) > e, ifA fe <s fe (t fe ) 

This leads to the following bound on the confidence 
diameter, 

Sfc(ife) > max(|(A fc + e),e) = H kt 

which can be obtained by a simple manipulation of the 
above equations. More precisely we can notice that in 



each case, s k (t k ) upper bounds both e and |(A fc + e), 
and thus it obviously bounds their maximum. 

Now, for any arm k we can consider the final number 
of arm pulls, which wc can write as 

N k (T) = N k (t k - 1) + 1 < g-^SkiU)) + 1 
<g- 1 {H ke ) + l. 

This holds due to the definition of g as a monotonic 
decreasing function, and the fact that we pull each arm 
at least once during the initialization stage. Finally, 
by summing both sides with respect to k we can see 
that X)fe5 _1 (^fce) + K > T, which contradicts our 
definition of g in the Theorem statement. □ 

B. Lemmas 

In order to simplify notation in this section, we will 
first introduce B{t) = min k B k (t) as the minimizcr 
over all gap indices for any time t. We will also note 
that this term can be rewritten as 

B(t)=B J(t) (t) = U m (t)-L m (t), 

which holds due to the definitions of j(t) and J(t). 

Lemma Bl. For any sub-optimal arm k ^ k* , any 
time t G {1,...,T}, and on event £, the immediate 
regret of pulling that arm is upper bounded by the index 
quantity, i.e. B k (t) > r k . 

Proof. We can start from the definition of the bound 
and expand this term as 

B k {t) = maxUi(t) - L k {t) 

> max/i, - fi k = /i* - Hh = r k . 

The first inequality holds due to the assumption of 
event £, whereas the following equality holds since we 
are only considering sub-optimal arms, for which the 
best alternative arm is obviously the optimal arm. □ 

Lemma B2. For any time t let k — a t be the arm 

pulled, for which the following statements hold: 

ifk = j(t), then L j(t) (t) < L J{t) (t), 
ifk = J(t), then U m {t) < U J(t) (t). 

Proof. We can divide this proof into two cases based 
on which of the two arms is selected. 

Case 1: let k = j(t) be the arm selected. We will then 
assume that Lj/ t \(t) > Lju\(t) and show that this is a 
contradiction. By definition of the arm selection rule 
we know that Sj^(t) > Sj^(t), from which we can 



Best arm identification via Bayesian gap-based exploration 



easily deduce that Uj^(t) > Uj^(t) by way of our 
first assumption. As a result we can see that 

B m (t) = max Uj(t) - L j{t) (t) 

3=F3(t) 

< max Uj{t) - L JW (t) = B J(t) (t). 

3=FJ(t) 

This inequality holds due to the fact that arm j(t) 
must necessarily have the highest upper bound over all 
arms. However, this contradicts the definition of J(t) 
and as a result it must hold that Ljt t \(t) < Lj^(t). 

Case 2: let fc = J(t) be the arm selected. The proof 
follows the same format as that used for fc = j(t). □ 

Corollary B2. // arm k = a t is pulled at time t, 
then the minimum index is bounded above by the un- 
certainty of arm k, or more precisely 

B(t) < s k (t). 

Proof. We know that k must be restricted to the set 
{j(t),J(t)} by definition. We can then consider the 
case that k = j{t), and by Lemma B2 we know that 
this imposes an order on the lower bounds of each pos- 
sible arm, allowing us to write 

B{t) <U m {t) - L j{t) {t) = s m (t) 

from which our corollary holds. We can then easily see 
that a similar argument holds for k — J(t) by ordering 
the upper bounds, again via Lemma B2. □ 

Lemma B3. On event £ , for any time t G {1, . . . , T}, 

and for arm k = a t the following bound holds on the 
minimal gap, 

B(t) <mm(0,s k (t)-A k ) + s k (t). 

Proof. In order to prove this lemma we will consider 
a number of cases based on which of k E {j(t), J(t)} 
is selected and whether or not one or neither of these 
arms corresponds to the optimal arm fc*. Ultimately, 
this results in six cases, the first three of which we will 
present are based on selecting arm fc = j(t). 

Case 1: consider fc* = fc = j(t). We can then see that 
the following sequence of inequalities holds, 

(a) (fc) (c) (d) 

M(2) > Vj(t)(t) > L J{t) (t) > L j(t) (t) > /j, k -s k (t). 

Here (b) and (d) follow directly from event £ and (c) 
follows from Lemma B2. Inequality (a) follows triv- 
ially from our assumption that fc = fc*, as a result J(t) 
can only be as good as the 2nd-best arm. Using the 



definition of A fe and the fact that fc = fc*, the above 
inequality yields 

s k(t) - (/ifc - M(2)) = s k (t) - A k > 

Therefore the min in the result of Lemma B3 vanishes 
and the result follows from Corollary B2. 



Case 2: consider fc 
then write 



j(t) and fc* = J(t). We can 



B(t) = U m (t)-L J{t) (t) 

< Vj(t)(t) + s m (t) - Vj(t)(t) + s m (t) 

< [L k - n* + 2s k (t) 

where the first inequality holds from event £ , and the 
second holds because by definition the selected arm 
must have higher uncertainty. We can then simplify 
this as 



= 2s k (t)-A k 

< min(0, s k (t) - A fe ) 



Sk(t), 



where the last step evokes Corollary B2. 

Case 3: consider fc = j(t) ^ fc* and J(t) ^ fc*. We 
can then write the following sequence of inequalities, 

(a) (fc) (c) 

A*i(t)(t) + *,•(*)(*) > U m {t) > U k ,(t) > fj, . 

Here (a) and (c) hold due to event £ and (b) holds since 
by definition j(t) has the highest upper bound other 
than J(t), which in turn is not the optimal arm by 
assumption in this case. By simplifying this expression 
we obtain s k (t) — A k > 0, and hence the result follows 
from Corollary B2 as in Case 1. 

Cases 4—6: consider fc = J(t). The proofs for these 
three cases follow the same general form as the above 
cases and is omitted. Cases 1 through 6 cover all pos- 
sible scenarios and prove Lemma B3. □ 

C. Modelling Bernoulli arms 

Consider K Bernoulli arms, each associated with an 
unknown parameter k e E, i.e. on pulling arm fc 
we receive reward y = 1 with probability 6 k . The 
standard, conjugate prior for such models is the Beta 
distribution, and we will associate each arm with a 
Beta(ao, /?o) prior. The posterior for this model is then 
also Beta distributed such that 

4(0 fc ) =Beta(0 fc ;a fc (t),0fc(t)) 

with initial parameters corresponding to those of the 
prior. If arm a t ~i — fc is pulled at time t — 1, the 



Best arm identification via Bayesian gap-based exploration 



posterior can be updated with 



atk(t) 



1)+I 1 (y t _ 1 ), 
l)+Io(y t -i), 



i.e. the parameters represent counts of successes and 
failures respectively. Note also that the distribution 
over the mean rewards is trivial in this situation as 
fJ-k = @k an d as a result p\ = ~k\. 

In this model, since the posteriors are not symmetric 
about their mean, bounds based on the standard de- 
viations do not necessarily represent fixed-confidence 
upper and lower bounds. Instead, we will use the 
(1 — 7)th quantile and the 7th quantile, respectively, 
where < 7 < \ will be determined later. Let Q 
be the quantile function defined such that for X ~ p, 
7 = Pr(X < Q( 7 ;p)). We can then write upper and 
lower bounds 

u k (t) = Q(i- r ,4)> 

L fe (i) = Q( 7 ;4). 

Corollary C3. Consider a K-armed Bernoulli ban- 
dit problem with horizon T and let Uk(t) and L k (t) be 
defined as above. For e > and a quantile parameter 

1 = eTq>{-±(T + K(N -2))/H e }, 

for Nq = ao + fto, then the algorithm attains simple 
regret satisfying 

Pr(-Rn (T) < e) > 1 - 2KTj. 

Proof. We will first consider the lower quantile L k (t) 
associated with arm k at time t. Let d(x, y) denote 
the KL-divergence between two Bernoulli random vari- 
ables with parameters x and y respectively. We will 
also define a = cx k (t) — 1 and n — a k (t ) + /3 k (t ) — 1 . By 
directly applying Lemma D2 we can bound the lower 
quantile with 

L k (t) > argminlx : l0g(1/7) > d(^,x)\, 

we can then use Pinsker's inequality to lower-bound 
the KL-divergence with a quadratic term 

>argmin(x: 1 ° g(1/7) > 2 (^-, N ) 2 }, 



and finally we can lower-bound the quadratic term 
with the same quadratic shifted towards zero 



x<a/r 



>argmin{x: l0g( n 1/7) >2(^-x) 2 } 



log(l/7) 




Figure 5. Illustration of the techniques used for bounding 
the lower quantile Lk(t) from below. 



Note also that we have restricted the set of possible 
bounds to be x < a/n < (a + l)/n. Finally, using 
a similar application of Lemma D2 and Pinkser's in- 
equality we can bound the upper quantile as 



Uk(t) < argmax < x 

x>a/n 



log(l/ 7 ) 



> 2 



A graphical illustration of the techniques used to ob- 
tain these bounds for the lower quantile is shown in 
Figure 5. 

We can easily see that the possible values for both 
x~ and x + are the values of x on either side of a/n 
for which the quadratic term 2 (a/n — x) 2 is below 
log(l/ 7 )/n. The bounds must then be given by the 
two values of x for which this quadratic term are great- 
est, i.e. a/n± ^/log(l/7)/(2n). We can then define the 
confidence diameter as 



s fe (i) < x+ 



21og(l/ 7 ) 



a k (t)+/3 k (t) - V 



Given the form of the Bayesian updates for this model, 
we know that that the parameters for each model are 
the success and failure counts plus the pseudo-counts. 
As a result we can obtain the arm pull counts by sub- 
tracting the pseudo-counts, i.e. 

N k (t) = a k (t) + j3 k (t) - N 

We can then rewrite the confidence diameter in order 
to define g(N) as 



a k (f) < J '\° g(1/7) =g(N k (t)). 
V N k (t)+N - 1 yy V ;; 

The bounding function g is strictly monotonically de- 
creasing and is invertible with 



g- 1 (s) = l-N + 



21og(l/ 7 ) 



Best arm identification via Bayesian gap-based exploration 



By setting X^fcLi = T — K and solving for 

7, we arrive at the definition of 7 given in the state- 
ment of this corollary. Finally, we can easily see by 
the definition of the quantile function that Uk(t) and 
Lk(t) bound the expected reward fik with probability 
1 — 27 for all k and t. These last two remarks sat- 
isfy the conditions of Theorem 1, thus completing our 
proof. □ 

D. Model-specific lemmas 

Lemma Dl. Consider a normally distributed random 
variable X ~ N{n, a 2 ) and (5 > 0. The probability that 
X is within a radius of f3a from its mean can then be 
written as 



Pt(\X-h\ <[3a) > l-e~ '' 



/2 



Proof. This proof follows from that of (Srinivas et al., 
2010). Consider Z ~ Af(0, 1). The probability that Z 
exceeds some positive bound c > can be written 



Pr(Z > c) = 



„-c 2 /2 r°° 



2ir 



-c 2 /2 



< 



V2tt 

e -c 2 /2 



/2tt 



e -(z-cf/2-c(z-c) dz 

e-^l 2 dz=\e-^l\ 



The inequality holds due to the fact that e~ c ( z ~ c ) < 1 
for z > c. Using a union bound we can then bound 
both sides as Pr(|Z| > c) < e _c I" 1 . Finally, by setting 
Z = (X — p) I a and c = (3 we obtain the bound stated 
above. □ 

Lemma D2. This proof follows a similar structure 
to that of (Kaufmann et al, 2012a). Consider X <~ 
Beta(a, b) for integers a and b. Let d(x, y) denote the 
KL-divergence between two Bernoulli random variables 
with parameters x and y respectively. For 7 < 0.5, let 
q~ f = (5(7, Beta(a, b)) denote the lower jth quantile 
for X and similarly define the upper quantile as qi- 7 . 
These quantiles can then be bounded as follows: 

q 7 > argmin {x : > d{^ =l , x) }, 

x< 0+6-1 



success probability x, we can think of S a+ i,-i lX as the 
number of uniform variates that are smaller than x. 

Based on this observation we can relate the cumulative 
distribution of X to the number of uniform random 
variables that lie above X, i.e. 

Pr(X <x)= Pr(S a+b - hx > a) 

<exp{ -( a + b-l)d( 7 ^ T ,x)}. 

This last inequality holds by Sanov's theorem for x < 
a/(a + b—l). We can then note the following sequence 
of implications: 

exp{ -(0 + b- ljdfj^.a;)} < 7 

Pr(X < x) < Pr(X <q^)^x< q 7 . 

The last inequality follows from the fact that Pr(X < 
x) increases monotonically in x. As a result we can find 
the minimum value of x for which the first inequality 
holds, which leads to the lower bound given in the 
statement of this lemma. 

Bounding the upper quantile relies on a very similar 
sequence of steps. We can first relate the probability 
that X exceeds some a; to a binomial random variable 

Pr(X >x)= Pr(S a+6 _i ;X < a - 1) 
= Pr(S a +b-i,i-x > b) 
<exp{-(a + fe-l)rf(^T,l-2;)} 
= cxp{-(a + b-l)d(^- 1 ,x)}. 

The final inequality follows from the fact that the KL- 
divergence satisfies d(y, 1 — x) = d(l — y,x). Note also 
that, as before, the inequality follows from Sanov's 
theorem, this time for x > (a — l)/(a + b— 1). Finally 
we can then note the following sequence of implications 

exp{ -(a + b-l)d(^- 1 ,x)} < 7 

=> Pr(X >x)< Pr(X > <7i_ 7 ) x > gi_ 7 . 

Here the final implication holds due to the fact that 
Pr(X > x) decreases monotonically in x. From here 
we can take the maximum x for which this bound holds 
in order to obtain the desired bound on the upper 
quantile. □ 



x> - 



.+6-1 



Proof. We will first note that for a collection of a+b— 1 
standard, uniform random variables the ath order 
statistic has distribution Beta(a, b). Letting S nx de- 
note a binomial random variable with n trials and 



